20 research outputs found

    A preliminary study in zero anaphora coreference resolution for Polish

    Get PDF
    A preliminary study in zero anaphora coreference resolution for PolishZero anaphora is an element of the coreference resolution task that has not yet been directly addressed in Polish and, in most studies, it has been left as the most challenging aspect for further investigation. This article presents an initial study of this problem. The preparation of a machine learning approach, alongside engineering features based on linguistic study of the KPWr corpus, is discussed. This study utilizes existing tools for Polish coreference resolution as sources of partial coreferential clusters containing pronoun, noun and named entity mentions. They are also used as baseline zero coreference resolution systems for comparison with our system. The evaluation process is focused not only on clustering correctness, without taking into account types of mentions, using standard CoNLL-2012 measures, but also on the informativeness of the resulting relations. According to the annotation approach used for coreference to the KPWr corpus, only named entities are treated as mentions that are informative enough to constitute a link to real world objects. Consequently, we provide an evaluation of informativeness based on found links between zero anaphoras and named entities. For the same reason, we restrict coreference resolution in this study to mention clusters built around named entities. Wstępne studium rozwiązywania problemu koreferencji anafory zerowej w języku polskimKoreferencja zerowa, w języku polskim, jest jednym z zagadnień rozpoznawania koreferencji. Dotychczas nie była ona bezpośrednim przedmiotem badań, gdyż ze względu na jej złożoność była pomijana i odsuwana na dalsze etapy badań. Artykuł prezentuje wstępne studium problemu, jakim jest rozpoznawanie koreferencji zerowej. Przedstawiamy podejście wykorzystujące techniki uczenia maszynowego oraz proces tworzenia cech w oparciu o analizę lingwistyczną korpusu KPWr. W przedstawionej pracy wykorzystujemy istniejące narzędzia do rozpoznawania koreferencji dla pozostałych rodzajów wzmianek (tj. nazwy własne, frazy rzeczownikowe oraz zaimki) jako źródło częściowych zbiorów wzmianek odnoszących się do tego samego obiektu, a także jako punkt odniesienia dla uzyskanych przez nas wyników. Ocena skupia się nie tylko na poprawności uzyskanych zbiorów wzmianek, bez względu na ich typ, co odzwierciedlają wyniki podane dla standardowych metryk CoNLL-2012, ale także na wartości informacji, która zostaje uzyskana w wyniku rozpoznania koreferencji. W nawiązaniu do założeń anotacji korpusu KPWr, jedynie nazwy własne traktowane są jako wzmianki, które zawierają w sobie wystarczająco szczegółową informację, aby można było powiązać je z obiektami rzeczywistymi. W konsekwencji dostarczamy także ocenę opartą na wartości informacji dla podmiotów domyślnych połączonych relacją koreferencji z nazwami własnymi. Z tą samą motywacją rozpatrujemy jedynie zbiory wzmianek koreferencyjnych zbudowane wokół nazw własnych

    Towards an event annotated corpus of Polish

    Get PDF
    Towards an event annotated corpus of PolishThe paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations

    Temporal Expressions in Polish Corpus KPWr

    Get PDF
    Temporal Expressions in Polish Corpus KPWrThis article presents the result of the recent research in the interpretation of Polish expressions that refer to time. These expressions are the source of information when something happens, how often something occurs or how long something lasts. Temporal information, which can be extracted from text automatically, plays significant role in many information extraction systems, such as question answering, discourse analysis, event recognition and many more. We prepared PLIMEX — a broad description of Polish temporal expressions with annotation guidelines, based on the state-of-the-art solutions for English, mainly TimeML specification. We also adapted the solution to capture the local semantics of temporal expressions, called LTIMEX. Temporal description also supports further event identification and extends event description model, focusing at anchoring events in time, ordering events and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines

    The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages

    Get PDF
    We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.Non peer reviewe

    The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages

    Get PDF
    We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.Non peer reviewe

    External advisor(s):

    No full text
    Engineering. The thesis is equivalent to 20 weeks of full time studies. Contact Information: Author(s)

    Pattern Acquisition Methods for Information Extraction Systems

    No full text
    This master thesis treats about Event Recognition in the reports of Polish stockholders. Event Recognition is one of the Information Extraction tasks. This thesis provides a comparison of two approaches to Event Recognition: manual and automatic. In the manual approach regular expressions are used. Regular expressions are used as a baseline for the automatic approach. In the automatic approach three Machine Learning methods were applied. In the initial experiment the Decision Trees, naive Bayes and Memory Based Learning methods are compared. A modification of the standard Memory Based Learning method is presented which goal is to create a classifier that uses only positives examples in the classification task. The performance of the modified Memory Based Learning method is presented and compared to the baseline and also to other Machine Learning methods. In the initial experiment one type of annotation is used and it is the meeting date annotation. The final experiment is conducted using three types of annotations: the meeting time, the meeting date and the meeting place annotation. The experiments show that the classification can be performed using only one class of instances with the same level of performance.(+48)66980861

    Towards Recognition of Spatial Relations between Entities for Polish

    No full text
    Towards Recognition of Spatial Relations between Entities for Polish In this paper, the problem of spatial relation recognition in Polish is examined. We present the different ways of distributing spatial information throughout a sentence by reviewing the lexical and grammatical signals of various relations between objects. We focus on the spatial usage of prepositions and their meaning, determined by the ‘conceptual’ schemes they constitute. We also discuss the feasibility of a comprehensive recognition of spatial relations between objects expressed in different ways by reviewing the existing tools and resources for text processing in Polish. As a result, we propose a heuristic method for the recognition of spatial relations expressed in various phrase structures called spatial expressions. We propose a definition of spatial expressions by taking into account the limitations of the available tools for the Polish language. A set of rules is used to generate candidates of spatial expressions which are later tested against a set of semantic constraints. The results of our work on recognition of spatial expressions in Polish texts were partially presented in (Marcińczuk, Oleksy, & Wieczorek, 2016). In that paper we focused on a detailed analysis of errors obtained using a set of basic morphosyntactic patterns for generating spatial expression candidates - we identified and described the most common sources of errors, i.e. incorrectly recognized or unrecognized expressions. In this paper we focused mainly on the preliminary stages of spatial expression recognition. We presented an extensive review on how the spatial information can be encoded in the text, types of spatial triggers in Polish and a detailed evaluation of morphosyntactic patterns which can be used to generate spatial expression candidates.   Rozpoznawanie relacji przestrzennych między obiektami fizycznymi w języku polskim Artykuł dotyczy zagadnienia rozpoznawania relacji przestrzennych w języku polskim. Autorzy przedstawili różne sposoby przekazywania w tekstach informacji na temat relacji przestrzennych między obiektami fizycznymi, uwzględniając sygnały o charakterze leksykalnym i gramatycznym. Istotną częścią artykułu jest omówienie znaczenia przyimków użytych w celu wyrażenia relacji przestrzennych. Znaczenie to kształtowane jest przez schematy konceptualne współtworzone przez poszczególne przyimki. Omówiono również możliwości kompleksowego rozpoznawania relacji przestrzennych wyrażonych za pomocą różnych środków językowych. Służy temu przegląd istniejących zasobów i narzędzi przetwarzania języka polskiego. Jako rezultat autorzy proponują heurystyczną metodę rozpoznawania relacji przestrzennych realizowanych językowo za pomocą struktur składniowych określonych jako wyrażenia przestrzenne. W artykule zaprezentowano definicję wyrażeń przestrzennych uwzględniającą specyfikę narzędzi dostępnych do przetwarzania języka polskiego. Zestaw reguł składniowych umożliwia wytypowanie fraz – kandydatów kwalifikujących się jako wyrażenia przestrzenne, które następnie zostają porównane z adekwatnym zestawem ograniczeń semantycznych

    Towards an event annotated corpus of Polish

    Get PDF
    Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations